Checking installation and loading packages

As usual we first always check and load in our required packages.

# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')

library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)

Checking the mean of time_on_social from last week

First we want to have another look at the PSYC2001_social-media-data.csv dataset from last week. To do this we first load it in using the sameread.csv() function combined with here().

social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in CSV files

Next we want to look at the mean of the variable time_on_social. We are going to do this in the same way we did in the last tutorial using by first altering all instances of -999 to become NA and then using the summmary() function.

social_media_NA <- social_media %>%
  mutate(time_on_social = na_if(time_on_social,-999)) #mutate alters columns and rows.
                                                      #na_if replaces -999 with NA.

summary(social_media_NA) #provides a summary of all variables in the data. 
##       id                 age        time_on_social      urban    
##  Length:60          Min.   :13.90   Min.   :1.240   Min.   :1.0  
##  Class :character   1st Qu.:15.70   1st Qu.:2.010   1st Qu.:1.0  
##  Mode  :character   Median :16.50   Median :2.410   Median :1.5  
##                     Mean   :16.87   Mean   :2.539   Mean   :1.5  
##                     3rd Qu.:17.43   3rd Qu.:3.047   3rd Qu.:2.0  
##                     Max.   :23.00   Max.   :4.320   Max.   :2.0  
##                                     NA's   :2                    
##  good_mood_likes bad_mood_likes    followers      polit_informed 
##  Min.   : 6.50   Min.   :12.20   Min.   : 61.40   Min.   :0.600  
##  1st Qu.:31.60   1st Qu.:39.08   1st Qu.: 76.47   1st Qu.:1.500  
##  Median :45.90   Median :49.30   Median :116.30   Median :1.800  
##  Mean   :43.04   Mean   :49.84   Mean   :124.76   Mean   :1.858  
##  3rd Qu.:53.40   3rd Qu.:58.75   3rd Qu.:153.75   3rd Qu.:2.200  
##  Max.   :89.20   Max.   :91.20   Max.   :336.50   Max.   :3.400  
##                                                                  
##  polit_campaign  polit_activism 
##  Min.   :0.800   Min.   :0.900  
##  1st Qu.:2.100   1st Qu.:2.400  
##  Median :2.550   Median :2.900  
##  Mean   :2.602   Mean   :2.977  
##  3rd Qu.:3.100   3rd Qu.:3.500  
##  Max.   :4.800   Max.   :5.500  
## 

What is the mean of time_on_social ?

Figure 1: Deja Vu

Figure 1: Deja Vu

Now we are going to start looking at the new dataset for this week. All the information about this data and its variables is located in the ReadME_What is the sampling distribution of the mean anyway?.txt file. If you have NOT read this yet please make sure you do !


Activity 1

We are now going to read the dataset that we need for this week into R. The dataset can be read into an object called global_social_media. Please use the read.csv() and here() functions to read in the PSYC2001_global-time-on-social-data.csv file in the code block below.

#Use the read.csv() and here() functions to read in the dataset.

global_social_media <- read.csv(file = here("Data","PSYC2001_global-time-on-social-data.csv")) #your code goes here

Checking the UNSW value in this dataset.

Now, lets check whether the UNSW value (reminder this is U49 from the ReadME file !) matches the mean value we had from the first week.

This should be pretty easy to do, and makes use of the filter() function we used last week. This function is able to filter rows in your dataset that match a certain condition.

global_social_media %>% 
  filter(uni_id == "U49")
##   uni_id mean_time_on_social
## 1    U49                2.54

Yay ! The output should match the mean of what we calculated last week. But what does this mean ? Why have we bothered to show you this ?

Each value in the new data file is the mean value for the time_on_social variable, for each of the 500 experiments run (U1-U500), i.e. the data contains 500 samples of the mean for time_on_social. This dataset is the result of repeating a single experiment, many times.


Vectors and the sample function

We are now going to have a look at what happens to the sampling distribution of the mean as we increase the number of mean samples in our samples (confusing I know !)

To do this we are going to make use of the sample() function in R. Lets first have a look at what this function does by using the ? syntax. This should have already opened in a webpage when you first knitted the document.

?sample

This function takes a vector as its first argument. What this means is that we cannot just give it the entire dataframe as it does not know what to do with it. This will result in an error.

sample(global_social_media, size = 10)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

But you may be asking now, what is a vector? You can think of a vector as a single column in our dataframe. In this instance above, since the function only takes a single column, it gets overwhelmed when we pass it the entire dataframe. Poor function !

Figure 2: Specter on a Vector

Figure 2: Specter on a Vector

So we need to pass it only a single column. We can do this by using the $ operator from baseR. In essence the $ operator extracts an element (i.e a column) from our dataframe.

Lets have a go at using it below:

head(global_social_media$uni_id)
## [1] "U1" "U2" "U3" "U4" "U5" "U6"

What does the sampling distribution of the mean look like with only a few samples ?

Great ! So we now know how to pass a vector (column) into the sample() function. First lets try extracting a low number of samples, say 20 samples.

set.seed(1) #ensures we always get the same result from sampling ! 

mean_sample_20 <- sample(global_social_media$mean_time_on_social, size = 20)

mean_sample_20
##  [1] 3.08 1.56 1.82 2.31 3.07 3.48 2.42 2.90 1.90 2.45 3.07 2.31 2.00 2.02 2.31
## [16] 1.75 2.36 3.21 2.35 2.23

This has extracted a sample of 20 means randomly from the global_social_media column mean_time_on_social.

Now, what we want to do next is to visualise this data using a histogram. However, there is a problem. The sample() function provides us with a single vector (column) but our plotting function ggplot() only likes dataframe. So we first need to convert our mean_sample_20 vector into a dataframe.

To do this we use the data.frame() function which is used to create dataframe in R.

mean_sample_20_df <- data.frame(sample_20 = mean_sample_20) #create a dataframe with a column called sample_20 that takes values from our vector

Now lets do some data viz ! We can create a histogram using ggplot() and the geom_historgram() functions from last week.

mean_sample_20_df %>% 
  ggplot(aes(x = sample_20)) +
  geom_histogram(fill = "skyblue", colour = "black") +  #fill and colour are Aesthetics. Fill controls the interior colour of shapes whereas colour controls the outline. 
  labs(x = "Time on Social media", y = "Count") #short for "labels", use to label axes and titles.

First notice that we are using some new aesthetics this week. We use the fill “skyblue” and the colour “black” to control the interior colour and border of our histogram respectively. You will learn more aesthetics that can be used to create nicer looking plots as each week.

Second, what is the shape of the histogram here ? Is it as you expected ?


What does the sampling distribution of the mean look like as we add more samples?

Now lets see what happens when we add in more samples.


Activity 2

Are you able to use the sample() function, data.frame function to create objects with 100, 250, 350 and 500 samples ? Use the code blocks below to do this (hint: this is just replicating what we have done above with some new object names). If you need any help please ask your tutor !

#fill in the code below !

mean_sample_100 <- sample(global_social_media$mean_time_on_social, size = 100) #create a sampling distribution of the mean with 100 samples
  
mean_sample_250 <- sample(global_social_media$mean_time_on_social, size = 250) #create a sampling distribution of the mean with 250 samples
  
mean_sample_350 <- sample(global_social_media$mean_time_on_social, size = 350) #create a sampling distribution of the mean with 350 samples
  
mean_sample_500 <- sample(global_social_media$mean_time_on_social, size = 500) #create a sampling distribution of the mean with 500 samples
#fill in the code below ! 
mean_sample_100_df <- data.frame(sample_100 = mean_sample_100) #create a dataframe with a column called sample_100 that takes values from our vector

mean_sample_250_df <- data.frame(sample_250 = mean_sample_250) #create a dataframe with a column called sample_250 that takes values from our vector

mean_sample_350_df <- data.frame(sample_350 = mean_sample_350) #create a dataframe with a column called sample_350 that takes values from our vector

mean_sample_500_df <- data.frame(sample_500 = mean_sample_500) #create a dataframe with a column called sample_500 that takes values from our vector

Well done ! This was a hard activity. It you are struggling please ask your tutor for help.

Figure 3: Ask for help !

Figure 3: Ask for help !

Next we need to visualise all of these new samples. It is important we do this so we can see what happens to our sampling distribution of the mean as we increase the number of mean samples.

We can do this by repeating the code for each histogram and giving it a new fill colour ! Note there are of course much cleaner ways to do this which we will go through in future weeks. If you are interested please see the guide here Using Facets in ggplot

mean_sample_100_df %>% 
ggplot(aes(x = sample_100)) +
  geom_histogram(fill = "red", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

mean_sample_250_df %>% 
ggplot(aes(x = sample_250)) +
  geom_histogram(fill = "blue", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

mean_sample_350_df %>% 
ggplot(aes(x = sample_350)) +
  geom_histogram(fill = "green", colour = "black")+
    labs(x = "Time on Social media", y = "Count")

mean_sample_500_df %>% 
ggplot(aes(x = sample_500)) +
  geom_histogram(fill = "orange", colour = "black") +
    labs(x = "Time on Social media", y = "Count")

How did the histogram change? Is it what you expected? Discuss this with your neighbours and your tutor.

Now imagine that there were infinite universities that ran this experiment, that each collected a group of people’s time_on_social scores, and each gave us their sample mean value. That is the theoretical sampling distribution of the mean.

The change in the shape of the histogram we have observed here is a critical implication of the central limit theorem - a large number of samples will lead to a approximately normal sampling distribution of the mean regardless of the actual population distribution.

Now thats all well and good but what does this actually mean ? Why do you think this actually matters for the statistics we do ? Discuss this with your neighbour and tutors.


Extension - What happens to a the sampling distribution of the mean for other population distributions ?

This section is an extension activity if you have already finished the required materials. Please check with your tutor that you have a good grasp of the material before moving onto this section.

Figure 4: Extension students be like

Figure 4: Extension students be like

Now lets get into it. Lets see what happens when we use an exponential population distribution and find its sampling distribution of the mean with a large number of samples.

First, we are going to generate the population distribution. This can be done easily using the function rexp() which is used to generate exponential distributions.

# Generate an exponential population distribution
population <- data.frame(value = rexp(100000, rate = 1)) #generate an expotential distribution with 100,000 datapoints. 

Next lets generate a histogram of this population distribution confirm that it looks like an expotential distribution.

# Plot population distribution
ggplot(population, aes(x = value)) +
  geom_histogram( bins = 100, fill = "skyblue", color = "black") +
  labs(
       x = "Value",
       y = "Frequency") +
  theme_classic() #themes can be provided to ggplot which give it a bunch of aesthetics to change. One of these is theme_classic

Do you think this looks like an exponential distribution ? What should an exponential distribution look like ? Ask your tutor if you are not sure !

Now we can take samples from this population. We use a new function here could replicate() which basically repeats the process inside the {} brackets a specified number of times. Inside the {} brackets is what we are actually doing population distribution. First we are taking a sample using the sample() function, then we are taking the mean of that sample using the mean() function. This will generate a similar set of data to PSYC2001_global-time-on-social-data.csv. That is, means of a bunch of samples from the population (i.e the sampling distirbution of the mean)

# Lets take 500 samples from this population of size 50 per sample and calculate the mean. 
sample_means <- replicate(500, { #replicate the process 500 times
  sample_values <- sample(population$value, size = 50, replace = TRUE) # sample 50 values from the population mean
  mean(sample_values) #take the mean of those 50 sampled values
})

So what we have generated is a sampling distribution of the mean with 500 samples. Lets first convert that into a dataframe so that we can use ggplot to visualise this data.

# Put results in a dataframe for plotting
sampling_df <- data.frame(sample_mean = sample_means)

#plot the results
sampling_df %>% 
ggplot(aes(x = sample_mean)) +
  geom_histogram(fill = "skyblue", color = "black") +
  labs(
       x = "Sample Mean",
       y = "Frequency") +
  theme_classic()

What is the shape of this distribution ? is it different to the population distribution from above ?

Finally and most importantly, what are the implications of what we have shown here ?